99 research outputs found
A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs
Analysis of DNA samples is an important step in forensics, and the speed of
analysis can impact investigations. Comparison of DNA sequences is based on the
analysis of short tandem repeats (STRs), which are short DNA sequences of 2-5
base pairs. Current forensics approaches use 20 STR loci for analysis. The use
of single nucleotide polymorphisms (SNPs) has utility for analysis of complex
DNA mixtures. The use of tens of thousands of SNPs loci for analysis poses
significant computational challenges because the forensic analysis scales by
the product of the loci count and number of DNA samples to be analyzed. In this
paper, we discuss the implementation of a DNA sequence comparison algorithm by
re-casting the algorithm in terms of linear algebra primitives. By developing
an overloaded matrix multiplication approach to DNA comparisons, we can
leverage advances in GPU hardware and algoithms for Dense Generalized
Matrix-Multiply (DGEMM) to speed up DNA sample comparisons. We show that it is
possible to compare 2048 unknown DNA samples with 20 million known samples in
under 6 seconds using a NVIDIA K80 GPU.Comment: Accepted for publication at the 2017 IEEE High Performance Extreme
Computing conferenc
Lincoln AI Computing Survey (LAICS) Update
This paper is an update of the survey of AI accelerators and processors from
past four years, which is now called the Lincoln AI Computing Survey - LAICS
(pronounced "lace"). As in past years, this paper collects and summarizes the
current commercial accelerators that have been publicly announced with peak
performance and peak power consumption numbers. The performance and power
values are plotted on a scatter graph, and a number of dimensions and
observations from the trends on this plot are again discussed and analyzed.
Market segments are highlighted on the scatter plot, and zoomed plots of each
segment are also included. Finally, a brief description of each of the new
accelerators that have been added in the survey this year is included.Comment: 7 pages, 6 figures, 2023 IEEE High Performance Extreme Computing
(HPEC) conference, September 202
AI and ML Accelerator Survey and Trends
This paper updates the survey of AI accelerators and processors from past
three years. This paper collects and summarizes the current commercial
accelerators that have been publicly announced with peak performance and power
consumption numbers. The performance and power values are plotted on a scatter
graph, and a number of dimensions and observations from the trends on this plot
are again discussed and analyzed. Two new trends plots based on accelerator
release dates are included in this year's paper, along with the additional
trends of some neuromorphic, photonic, and memristor-based inference
accelerators.Comment: 10 pages, 4 figures, 2022 IEEE High Performance Extreme Computing
(HPEC) Conference. arXiv admin note: substantial text overlap with
arXiv:2009.00993, arXiv:2109.0895
Streaming Graph Challenge: Stochastic Block Partition
An important objective for analyzing real-world graphs is to achieve scalable
performance on large, streaming graphs. A challenging and relevant example is
the graph partition problem. As a combinatorial problem, graph partition is
NP-hard, but existing relaxation methods provide reasonable approximate
solutions that can be scaled for large graphs. Competitive benchmarks and
challenges have proven to be an effective means to advance state-of-the-art
performance and foster community collaboration. This paper describes a graph
partition challenge with a baseline partition algorithm of sub-quadratic
complexity. The algorithm employs rigorous Bayesian inferential methods based
on a statistical model that captures characteristics of the real-world graphs.
This strong foundation enables the algorithm to address limitations of
well-known graph partition approaches such as modularity maximization. This
paper describes various aspects of the challenge including: (1) the data sets
and streaming graph generator, (2) the baseline partition algorithm with
pseudocode, (3) an argument for the correctness of parallelizing the Bayesian
inference, (4) different parallel computation strategies such as node-based
parallelism and matrix-based parallelism, (5) evaluation metrics for partition
correctness and computational requirements, (6) preliminary timing of a
Python-based demonstration code and the open source C++ code, and (7)
considerations for partitioning the graph in streaming fashion. Data sets and
source code for the algorithm as well as metrics, with detailed documentation
are available at GraphChallenge.org.Comment: To be published in 2017 IEEE High Performance Extreme Computing
Conference (HPEC
Layer-Parallel Training with GPU Concurrency of Deep Residual Neural Networks via Nonlinear Multigrid
A Multigrid Full Approximation Storage algorithm for solving Deep Residual
Networks is developed to enable neural network parallelized layer-wise training
and concurrent computational kernel execution on GPUs. This work demonstrates a
10.2x speedup over traditional layer-wise model parallelism techniques using
the same number of compute units.Comment: 7 pages, 6 figures, 27 citations. Accepted to 2020 IEEE High
Performance Extreme Computing Conference - Outstanding Paper Awar
Performance Measurements of Supercomputing and Cloud Storage Solutions
Increasing amounts of data from varied sources, particularly in the fields of
machine learning and graph analytics, are causing storage requirements to grow
rapidly. A variety of technologies exist for storing and sharing these data,
ranging from parallel file systems used by supercomputers to distributed block
storage systems found in clouds. Relatively few comparative measurements exist
to inform decisions about which storage systems are best suited for particular
tasks. This work provides these measurements for two of the most popular
storage technologies: Lustre and Amazon S3. Lustre is an open-source, high
performance, parallel file system used by many of the largest supercomputers in
the world. Amazon's Simple Storage Service, or S3, is part of the Amazon Web
Services offering, and offers a scalable, distributed option to store and
retrieve data from anywhere on the Internet. Parallel processing is essential
for achieving high performance on modern storage systems. The performance tests
used span the gamut of parallel I/O scenarios, ranging from single-client,
single-node Amazon S3 and Lustre performance to a large-scale, multi-client
test designed to demonstrate the capabilities of a modern storage appliance
under heavy load. These results show that, when parallel I/O is used correctly
(i.e., many simultaneous read or write processes), full network bandwidth
performance is achievable and ranged from 10 gigabits/s over a 10 GigE S3
connection to 0.35 terabits/s using Lustre on a 1200 port 10 GigE switch. These
results demonstrate that S3 is well-suited to sharing vast quantities of data
over the Internet, while Lustre is well-suited to processing large quantities
of data locally.Comment: 5 pages, 4 figures, to appear in IEEE HPEC 201
Enabling On-Demand Database Computing with MIT SuperCloud Database Management System
The MIT SuperCloud database management system allows for rapid creation and
flexible execution of a variety of the latest scientific databases, including
Apache Accumulo and SciDB. It is designed to permit these databases to run on a
High Performance Computing Cluster (HPCC) platform as seamlessly as any other
HPCC job. It ensures the seamless migration of the databases to the resources
assigned by the HPCC scheduler and centralized storage of the database files
when not running. It also permits snapshotting of databases to allow
researchers to experiment and push the limits of the technology without
concerns for data or productivity loss if the database becomes unstable.Comment: 6 pages; accepted to IEEE High Performance Extreme Computing (HPEC)
conference 2015. arXiv admin note: text overlap with arXiv:1406.492
Lustre, Hadoop, Accumulo
Data processing systems impose multiple views on data as it is processed by
the system. These views include spreadsheets, databases, matrices, and graphs.
There are a wide variety of technologies that can be used to store and process
data through these different steps. The Lustre parallel file system, the Hadoop
distributed file system, and the Accumulo database are all designed to address
the largest and the most challenging data storage problems. There have been
many ad-hoc comparisons of these technologies. This paper describes the
foundational principles of each technology, provides simple models for
assessing their capabilities, and compares the various technologies on a
hypothetical common cluster. These comparisons indicate that Lustre provides 2x
more storage capacity, is less likely to loose data during 3 simultaneous drive
failures, and provides higher bandwidth on general purpose workloads. Hadoop
can provide 4x greater read bandwidth on special purpose workloads. Accumulo
provides 10,000x lower latency on random lookups than either Lustre or Hadoop
but Accumulo's bulk bandwidth is 10x less. Significant recent work has been
done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo
to be combined in different ways.Comment: 6 pages; accepted to IEEE High Performance Extreme Computing
conference, Waltham, MA, 201
Lessons Learned from a Decade of Providing Interactive, On-Demand High Performance Computing to Scientists and Engineers
For decades, the use of HPC systems was limited to those in the physical
sciences who had mastered their domain in conjunction with a deep understanding
of HPC architectures and algorithms. During these same decades, consumer
computing device advances produced tablets and smartphones that allow millions
of children to interactively develop and share code projects across the globe.
As the HPC community faces the challenges associated with guiding researchers
from disciplines using high productivity interactive tools to effective use of
HPC systems, it seems appropriate to revisit the assumptions surrounding the
necessary skills required for access to large computational systems. For over a
decade, MIT Lincoln Laboratory has been supporting interactive, on-demand high
performance computing by seamlessly integrating familiar high productivity
tools to provide users with an increased number of design turns, rapid
prototyping capability, and faster time to insight. In this paper, we discuss
the lessons learned while supporting interactive, on-demand high performance
computing from the perspectives of the users and the team supporting the users
and the system. Building on these lessons, we present an overview of current
needs and the technical solutions we are building to lower the barrier to entry
for new users from the humanities, social, and biological sciences.Comment: 15 pages, 3 figures, First Workshop on Interactive High Performance
Computing (WIHPC) 2018 held in conjunction with ISC High Performance 2018 in
Frankfurt, German
- …